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Abstract. We consider the setting of stochastic bandit problems with a 
continuum of arms indexed by [0, l] d . We first point out that the strate- 
gies considered so far in the literature only provided theoretical guar- 
antees of the form: given some tuning parameters, the regret is small 
with respect to a class of environments that depends on these parame- 
ters. This is however not the right perspective, as it is the strategy that 
should adapt to the specific bandit environment at hand, and not the 
other way round. Put differently, an adaptation issue is raised. We solve 
it for the special case of environments whose mean-payoff functions are 
globally Lipschitz. More precisely, we show that the minimax optimal 
orders of magnitude L d/( - d+2) T ( - d+1)/(d+2) of the regret bound over T 
time instances against an environment whose mean-payoff function / is 
Lipschitz with constant L can be achieved without knowing L or T in 
advance. This is in contrast to all previously known strategies, which 
require to some extent the knowledge of L to achieve this performance 
guarantee. 



1 Introduction 

In the (stochastic) bandit problem, a gambler tries to maximize the revenue 
gained by sequentially playing one of a finite number of arms that are each 
associated with initially unknown (and potentially different) payoff distribu- 
tions Rob52 . The gambler selects and pulls arms one by one in a sequential 
manner, simultaneously learning about the machines' payoff-distributions and 
accumulating rewards (or losses). Thus, in order to maximize his gain, the gam- 
bler must choose the next arm by taking into consideration both the urgency of 
gaining reward ("exploitation") and acquiring new information ("exploration"). 
Maximizing the total cumulative payoff is equivalent to minimizing the (total) 
regret, that is, minimizing the difference between the total cumulative payoff of 
the gambler and that of another clairvoyant gambler who chooses the arm with 
the best mean-payoff in every round. The quality of the gambler's strategy can 
be characterized by the rate of growth of his expected regret with time. In par- 
ticular, if this rate of growth is sublinear, the gambler in the long run plays as 
well as his clairvoyant counterpart. 

Continuum-armed bandit problems. Although the early papers studied 
bandits with a finite number of arms, researchers soon realized that bandits with 



infinitely many arms are also interesting, as well as practically significant. One 
particularly important case is when the arms are identified by a finite number 
of continuous-valued parameters, resulting in online optimization problems over 
continuous finite-dimensional spaces. During the last decades numerous contri- 
butions have investigated such continuum-armed bandit problems, starting from 
the early formulations of |Agr95 Cop09 Kle04 to the more recent approaches of 
|AOS07IKSU08IBMSSTl] . A special case of interest, which forms a bridge be- 
tween the case of a finite number of arms and the continuum-armed setting, 
is the problem of bandit linear optimization, see [DHK08] and the references 
therein. 

Not the right perspective! We call an environment / the mapping that as- 
sociates with each arm x € X the expectation f(x) of its associated probability 
distribution. The theoretical guarantees given in the literature mentioned above 
are of the form: given some tuning parameters, the strategy is competitive, and 
sometimes even minimax optimal, with respect to a large class of environments 
that unfortunately depends on these parameters. But of course, this is not the 
right perspective: it is the strategy that should adapt to the environment, not 
the other way round! 

More precisely, these parameters describe the smoothness of the environments 
/ in the class at hand in terms of a global regularity and/or local regularities 
around the global maxima of /. The issues raised by some of the works mentioned 
above can be roughly described as follows: 

— The class of environments for the CAB1 algorithm of Kle04 is formed by 
environments that are (a, L, 5)-uniformly locally Lipschitz and the strategy 
CAB1 needs to know a to get the optimal dependency in the number T of 
arms pulled; 

— For the Zooming algorithm of KSU08], it is formed by environments that 
are 1-Lipschitz with respect to a fixed and known metric L; 

— The HOO algorithm of IBMSSli ] basically needs to know the pseudo-metric 
I with respect to which / is weakly Lipschitz continuous, with Lipschitz 
constant equal to 1; 

— Other examples include the UCB-air algorithm (which relies on a smoothness 
parameter j3, see [WAM09 ), the OLOP algorithm (smoothness parameter 
7, see [BMlOp . the LSE algorithm (smoothness parameter Cl, see [YM11] ). 
the algorithm presented in )AOS07j and so on. 

Adaptation to the unknown smoothness is needed. In a nutshell, adaptive 
methods are required. By adaptive methods, we mean — as is done in the statis- 
tical literature — agnostic methods, i.e., with minimal prior knowledge about /, 
that nonetheless obtain almost the same performance against a given environ- 
ment / as if its smoothness were known beforehand. 

More precisely, given a fixed (possibly vector-valued) parameter L lying in 
a set C and a class of allowed environments J~l, where L € £, existing works 
present algorithms that are such that their worst-case regret bound over T time 



steps against environments in Tl, 

sup R T (f) < <p(T,L), 

is small and even minimax optimal, i.e., such that it has the optimal dependencies 
on T and L. However, to do so, the knowledge of L is required. In this work, we 
are given a much larger class of environments T = Ulec and our goal is an 
algorithm that adapts in finite time to every instance / of J- ', in the sense that 
for all T and / € J-, the regret i?r(/) is at most of the order of min ip(T, L), 
where the minimum is over the parameters L such that / £ Tl- 

Since we are interested in worst-case bounds, we will have to consider distri- 
bution-free bounds (i.e., bounds that only depend on a given class J-l); of course, 
the orders of magnitude of the latter, even when they are minimax optimal, are 
often far away -as far as the dependencies in T are concerned- with respect 
to distribution-dependent bounds (i.e., bounds that may depend on a specific 
instance / 6 J-l). 

Links with optimization algorithms. Our problem shares some common 
points with the maximization of a deterministic function / (but note that in our 
case, we only get to see noisy observations of the values of /). When the Lip- 
schitz constant L of / is known, an approximate maximizer can be found with 
well-known Lipschitz optimization algorithms (e.g., Shubert's algorithm). The 
case of unknown L has been studied in [JPS93 Hor06 . The DIRECT algorithm 
of [JPS93] carries out Lipschitz optimization by using the smallest Lipschitz con- 
stant that is consistent with the observed data; although it works well in practice, 
only asymptotic convergence can be guaranteed. The algorithm of |Hor06] iter- 
ates over an increasing sequence of possible values of L; under an additional 
assumption on the minimum increase in the neighborhood of the maximizers, it 
guarantees a worst-case error of the order of L 2 T~ 2 / d after taking T samples of 
the deterministic function /. 

Adaptation to a global Lipschitz smoothness in bandit problems. We 

provide in this paper a first step toward a general theory of adaptation. To 
do so, we focus on the special case of classes J-l formed by all environments 
/ that are L-Lipschitz with respect to the supremum norm over a subset of 
IR d : the hypercube [0, l] d for simplicity. This case covers partially the settings 
of [BMSSli] and KSU08 , in which the Lipschitz constant was equal to 1, a 
fact known by the algorithms. (Extensions to Holderian-type assumptions as 
in Klc04 AOS07 will be considered in future work.) 

As it is known, getting the minimax-optimal dependency on T is easy, the 
difficult part is getting that on L without knowing the latter beforehand. 
Our contributions. Our algorithm proceeds by discretization as in |Kle04j . 
To determine the correct discretization step, it first resorts to an uniform explo- 
ration yielding a rather crude estimate of the Lipschitz constant (that is however 
sufficient for our needs); in a second phase, it finds the optimal interval using 
a standard exploration-exploitation strategy. Our main assumptions are (essen- 
tially) that / and its derivative are Lipschitz continuous in the hypercube. 



We feel that this two-step approach can potentially be employed in more gen- 
eral settings well beyond ours: with the notation above, the uniform-exploration 
phase performs a model-selection step and recommends a class , which is used 
in the second phase to run a continuum-armed bandit strategy tuned with the 
optimal parameters corresponding to L 6 C. However, for the sake of simplicity, 
we study only a particular case of this general methodology. 

Outline of the paper. In Section [21 we describe the setting and the classes 
of environments of interest, establish a minimax lower bound on the achievable 
performance (Section [OJ, and indicate how to achieve it when the global Lip- 
schitz parameter L is known (Section I2.2[) . Our main contribution (Section [3]) 
is then a method to achieve it when the Lipschitz constant is unknown for a 
slightly restricted class of Lipschitz functions. 

2 Setting and notation 

We consider a d-dimensional compact set of arms, say, for simplicity, X = [0, 
where d 5* 1. With each arm x £ [0, l] d is associated a probability distribution 
v x with known bounded support, say [0, 1]; this defines an environment. A key 
quantity of such an environment is given by the expectations f(x) of the dis- 
tributions v x . They define a mapping / : [0,l] d [0,1], which we call the 
mean-payoff function. 

At each round t ^ 1, the player chooses an arm I_ t G [0, l] d and gets a reward 
Yt sampled independently from i/j (conditionally on the choice of J_ t ). We call a 
strategy the (possibly randomized) rule that indicates at each round which arm 
to pull given the history of past rewards. 

We write the elements x of [0, l] d in columns; a; T will thus denote a row 
vector with d elements. 

Assumption 1 We assume that / is twice differentiable, with Hessians uni- 
formly bounded by M in the following sense: for all x e [0, l] d and all y G [0, 

V T H f (x) y 

The € 1 -norm of the gradient ||V/|ji of / is thus continuous and it achieves its 
maximum on [0,l] d , whose value is denoted by L. As a result, / is Lipschitz 
with respect to the ^°°-norm with constant L (and L is the smallest constant 
for which it is Lipschitz): for all ije [0, 

\f{x)-f{y)\ <L||s-y||oo. 

In the sequel we denote by !Fl,m the set of environments whose mean-payoff 
functions satisfy the above assumption. 

We also denote by Fl the larger set of environments whose mean-payoff 
functions / is only constrained to be L-Lipschitz with respect to the ^°°-norm. 



^M\\y\ 



4 The proof of the approximation lemma will show why this is the case. 



2.1 The minimax optimal orders of magnitude of the regret 



When / is continuous, we denote by 
/* 



SU P fix) = max fix) 
xe[o,i] d xe[Q,i] d 



the largest expected payoff in a single round. The expected regret Rt at round 
T is then defined as 



Rt = IE 



where we used the tower rule and where the expectations are with respect to 
the random draws of the Y t according to the vi as well as to any auxiliary 
randomization the strategy uses. 

In this article, we are interested in controlling the worst-case expected re- 
gret over all environments of J-i,. The following minimax lower bound follows 
from a straightforward adaptation of the proof of BMSS11, Theorem 13], which 
is provided in Section O in appendix. (The adaptation is needed because the 
hypothesis on the packing number is not exactly satisfied in the form stated 
in [BMSSllj .1 

Theorem 1. For all strategies of the player and for all 



T > max { L a 



ml5L 2/{d+2y 
max{<i, 2} 



the worst-case regret over the set Tl of all environments that are L-Lipschitz 
with respect to the l°°-norm is larger than 

sup R T ^0.15 L d/(d+2) T { d +i)/ {d+ 2) ^ 

The multiplicative constants are not optimized in this bound (a more careful 
proof might lead to a larger constant in the lower bound). 



2.2 How to achieve a minimax optimal regret when L is known 

In view of the previous section, our aim is to design strategies with worst-case 
expected regret sup^ i?r less than something of order l,d/(d+2) T(d+i)/(d+2) 
when L is unknown. A simple way to do so when L is known was essentially 
proposed in the introduction of KleOj] (in the case d = 1); it proceeds by 
discretizing the arm space. The argument is reproduced below and can be used 
even when L is unknown to recover the optimal dependency T^ d+1 ^ ^ d+2 ~> on T 
(but then, with a suboptimal dependency on L). 

We consider the approximations f m of / with m d regular hypercube bins 
in the €°°-norm, i.e., m bins are formed in each direction and combined to 



form the hypercubes. Each of these hypercube bins is indexed by an element 
k = (ki, . . . , kd) £ {0, . . . , m — l} d . The average value of / over the bin indexed 
by k is denoted by 



f m (k)=m d [ f(x)dx. 

Jk/m+\0.1/m] d 



k/m+[0,l/m] d 

We then consider the following two-stage strategy, which is based on some 
strategy MAB for multi-armed bandits; MAB will refer to a generic strategy but 
we will instantiate below the obtained bound. Knowing L and assuming that T is 
fixed and known in advance, we may choose beforehand m = |"L 2 /( d + 2 )T 1 /( <i+2 )] . 
The decomposition of [0, l] d into m d bins thus obtained will play the role of the 
finitely many arms of the multi-armed bandit problem. At round t 1, whenever 
the MAB strategy prescribes to pull bin K_ t £ {0, . . . , m — l} d , then first, an arm 
l_ t is pulled at random in the hypercube K_ t /m + [0, l/m] d ; and second, given 
J t , the reward Yt is drawn at random according to . Therefore, given K_ tl the 
reward Y t has an expected value of f m (K t ). Finally, the reward Y t is returned 
to the underlying MAB strategy. 

Strategy MAB is designed to control the regret with respect to the best of 
the m d bins, which entails that 



E 



T 

Tmax/ m (fc)^y ( 

K * ' 

- t=l 



ij}(T, m d ) 



for some function ip that depends on MAB. Now, whenever / is L-Lipschitz 
with respect to the £°°-norm, we have that for all A: € {0, . . . , m — l} d and all 
x G k/m + [0, l/m] d , the difference \ f(x) — / m (fc)| is less than L/m; so thalH 

max fix) — max f m (k) C — . 
xe[o,i] d fc mW m 

All in all, for this MAB-based strategy, the regret is bounded by the sum of the 
approximation term L/m and of the regret term for multi-armed bandits, 



sup R T < T — + V(T,m d ) 

Tl m 

^ L d/(d+2) T (d+i)/(d+2) + ijj(r, ([ J L 2 /( d + 2 )r 1/(!i+2) ]) d ^ . (1) 

We now instantiate this bound. 

5 Here, one could object that we only use the local Lipschitzness of / around the 
point where it achieves its maximum; however, we need / to be L-Lipschitz in an 
1/m-neighborhood of this maximum, but the optimal value of m depends on L. To 
solve the chicken-egg problem, we restricted our attention to globally L-Lipschitz 
functions, which, anyway, in view of Theorem[T] comes at no cost as far as minimax- 
optimal orders of magnitude of the regret bounds in L and T are considered. 



The INF strategy of [ABlOj (see also [ABL11] ) achieves ij;(T, ml) = 2V2Tm' 
and this entails a final Q(L rf / ( d + 2 ) T^ 1 ) / i d + 2 )) boun d in (fT]. N ote that for the 
EXP3 strategy of [ACBFS02] or the UCB strategy of }ACBF02j . extra logarith- 
mic terms of the order of In T would appear in the bound. 



3 Achieving a minimax optimal regret not knowing L 

In this section, our aim is to obtain a worst-case regret bound of the minimax- 
optimal order of L d /( d + 2 ) T < ~ d + 1 )/( d + 2 ) even when L is unknown. To do so, it will 
be useful to first estimate L; we will provide a (rather crude) estimate suited 
to our needs, as our goal is the minimization of the regret rather than the best 
possible estimation of L. Our method is based on the following approximation 
results. 

For the estimation to be efficient, it will be convenient to restrict our attention 
to the subset !Fl,m of J-l, i.e., we will consider the additional assumptions on the 
existence and boundedness of the Hessians asserted in Assumption [T] However, 
the obtained regret bound (Theorem^ will suffer from some (light) dependency 
on M but will have the right orders of magnitude in T and L; the forecaster 
used to achieve it depends neither on L nor on M and is fully adaptive. 



3.1 Some preliminary approximation results 

We still consider the approximations f m of / over [0, l) d with m d regular bins. 
We then introduce the following approximation of L: 



m max max 

fc£{l,...,m-2} d s6{-l,l} d 



f m (k)-f m (k + s) 



This quantity provides a fairly good approximation of the Lipschitz constant, 
since m (f m (k) — f m {k + s)) is an estimation of the (average) derivative of / in 
bin k and direction s. 

The lemma below relates precisely L m to L: cLS Tfl incr6ciS6S, Lrn C0IlV6rg6S 
to L. 

Lemma 1. If f G J-l.m o-nd m ^ 3 7 then 

r 7M - 

L s$ L m < L . 

m 

Proof. We note that for all k € {1, . . . , m — 2} d and s G { — 1, l} d , we have by 
definition 



f m (k)-f m (k+s) 



k/m+[0,l/m] d 



(f(x)-f(x + s/r 



dx 



fc/m+[0,l/m]° 



f{x) - f{x + s/m) 



dx . 



Now, since / is L-Lipschitz in the ^°°-norm, it holds that 



f{s±) ~ /(•£ + s/m\ L || s/m | 



L 

rn 



integrating this bound entails the stated upper bound L on L m . 

For the lower bound, we first denote by x+ G [0,l] d a point such that 
|| 1 1 = L. (Such a point always exists, see Assumption [1]) This point be- 

longs to some bin in {0, . . . , m—l} d ; however, the closest bin k m in {1, ... , m—2} d 
is such that 



VxG kl l /m+ [0, 1/m] 



(2) 



Note that this bin k m is such that all fc^ + s belong to {0, . . . , m — l} d and 
hence legally index hypercube bins, when s G {—1, Now, let s m £ {— 1, l} d 
be such that 

V/feJ HV/feJll^i, (3) 

where • denotes the inner product in M. d . By the definition of L m as some 
maximum, 



fc^/m-(- 



(f(x)-f(x+g n /T 
[0,l/m] d v 



d:r 



(4) 



Now, Taylor's theorem (in the mean-value form for real-valued twice diffcr- 
entiable functions of possibly several variables) shows that for any x G k m /m + 
[0, l/m] d , there exists two elements £ and (, belonging respectively to the seg- 
ments between x and x+, on the one hand, between x+ and x + s m /m on the 
other hand, such that 

f(x)~ f(x + s m /m) 

= (fix) - f(xj) + (/(zj - f(x + sUm)) 

= V/(zJ • (x~x+) + ^(x-x+) T H f (0 {x-xj 

1 T 

- V/feJ • (x + s*„/m - x*) - - (x + s* n /m - x*) H f {Q (x + s m /m- x±) 

= -V/(xJ • ^ +ife-xJ T ff^O (x-xj 
It 

- 2 + ^m/ TO ~ £*) (0 + - 2*) ■ 

Using (j3|) and substituting the bound on the Hessians stated in Assumption [l] 
we get 

r I \ el * I \ ^ L M .. . . 2 M I, 1 1 2 

/(£) - /U+s m /m) s$ -— + —ll^-^ll^ + _ \\x + s m /m-x+ ^ ; 



substituting @, we get 



, v L M , 9N L 7M 

f(x)-f{x + <tjm) ^ + — 2 2 + 3 2 ^ + — <0, 

v ' m 2m 2 - m m 2 - 

where the last inequality holds with no loss of generality (if it does not, then the 
lower bound on L m in the statement of the lemma is trivial). Substituting and 
integrating this equality in (0| and using the triangle inequality, we get 

— 7M 

m 

This concludes the proof. □ 



3.2 A strategy in two phases 

Our strategy is described in Figure [TJ several notation that will be used in the 
statements and proofs of some results below are defined therein. Note that we 
proceed in two phases: a pure exploration phase, when we estimate L by some 
L m , and an exploration-exploitation phase, when we use a strategy designed 
for the case of finitely-armed bandits on a discretized version of the arm space. 
(The discretization step depends on the estimate obtained in the pure exploration 
phase.) 

The first step in the analysis is to relate L m and L m to the quantity they are 
estimating, namely L m . 

Lemma 2. With probability at least 1 — S, 



V2 2m d 
-jji m — ■ 

Proof. We consider first a fixed k 6 {0, . . . ,m — l} d ; as already used in Sec- 
tion 12. 2[ the Zk j are independent and identically distributed according to a 
distribution on [0, 1] with expectation f m (k), as j varies between 1 and E. There- 
fore, by Hoeffding's inequality, with probability at least 1 — S/m d 



Performing a union bound and using the triangle inequality, we get that with 
probability at least 1 — 5, 



, — 2 2m d 

Vfc,fc e {o,...,m- 1}, | A*fe - %'\ - \f m (k) - f m (k)\ ^ y^ ln ~- 

This entails the claimed bound. □ 

By combining Lemmas [1] and [2] we get the following inequalities on L m , since 
the latter is obtained from L m by adding a deviation term. 



Parameters: 

— Number T of rounds; 

— Number m of bins (in each direction) considered in the pure exploration phase; 

— Number E of times each of them must be pulled; 

— A multi-armed bandit strategy MAB (taking as inputs a number m d of arms and 
possibly other parameters). 

Pure exploration phase: 

1. For each k € {0, . . . , m - l} d 

— pull E arms independently uniformly at random in k/m + [0, l/m] d and get 
E associated rewards Zkj, where j € {1, . . . , E}; 

— compute the average reward for bin k, 

1 E 

3 = 1 

2. Set 

L m = m max max Ijlk — fik+ s \ 

fce{i,...,m-2} d se{-i,i} d 

and define L m = L m + mJ -| \n(2m d T) as well as fh = |"ZV(d+2) T i/(d+2) j 
Exploration-exploitation phase: 

Run the strategy MAB with m d arms as follows; for all t = i5m + 1, . . . , T, 

1. If MAB prescribes to play arm 2£ t € {0,...,m- 1} , pull an arm /, at random 
in KJm+ [0, l/m] d ; 

2. Observe the associated payoff Yt, drawn independently according to vj_ t ; 

3. Return Y t to the strategy MAB. 



Fig. 1. The considered strategy. 



Corollary 1. If f 6 Tl,m and m ^ 3, then, with probability at least 1 — 1/T, 



L < L m < L + 2m\ —\n(2m d T) . 

m V E ' 

We state a last intermediate result; it relates the regret of the strategy of 
Figure Q] to the regret of the strategy MAB that it takes as a parameter. 

Lemma 3. Let , ijj(T',m') be a distribution-free upper bound on the expected re- 
gret of the strategy MAB, when run for T' rounds on a multi-armed bandit prob- 
lem with m' arms, to which payoff distributions over [0, 1] are associated. The 
expected regret of the strategy defined in Figure [7] is then bounded from above as 

sup R T < Em d + E 

Proof. As all payoffs lie in [0, 1], the regret during the pure exploration phase is 
bounded by the total length Em d of this phase. 

Now, we bound the (conditionally) expected regret of the MAB strategy 
during the exploration-exploitation phase; the conditional expectation is with 
respect to the pure exploration phase and is used to fix the value of rh. Using 
the same arguments as in Section l2~2l the regret during this phase, which lasts 
T — Em d rounds, is bounded against any environment in Tl by 

L T_Em d _ + _ Emd ^ „ d 
m 

The tower rule concludes the proof. □ 

We are now ready to state our main result. Note that it is somewhat unsat- 
isfactory as the main regret bound ([5]) could only be obtained on the restricted 
class J 7 l,m and depends (in a light manner, see comments below) on the pa- 
rameter M, while having the optimal orders of magnitude in T and L. These 
drawbacks might be artifacts of the analysis and could be circumvented, per- 
haps in the light of the proof of the lower bound (Theorem [1} , which exhibits 
the worst-case elements of J~l (they seem to belong to some set !Fl,m l , where 
Mr, is a decreasing function of L). 

Note howewer that the dependency on M in ([5]) is in the additive form. On the 
other hand, by adapting the argument of Section [2.2[ one can prove distribution- 
free regret bounds on Tl,m with an improved order of magnitude as far as T 
is concerned but at the cost of getting M as a multiplicative constant in the 
picture. While the corresponding bound might be better in some regimes, we 
consider here Fl,m instead of Tj, essentially to remove pathological functions, 
and as such we want the weakest dependency on M. 



J T 

— +ip(T - Em d , m d ) 
m 



Theorem 2. When used with the multi-armed strategy INF, the strategy of Fig- 
ure [7] ensures that 

xd/(d+2)\ 



sup R T < T( d+1 )/( d + 2 > ( 9L rf /( d + 2 ) + 5 ( 2m^/-|ln(2r d +i) 



as soon as 



In particular, for 



Em d + 2V2Td d + 1 (5) 



m ^ — . (6) 



d(d + l) , 1 d + l 3d + 2\ , _ 

(3<i + 2)(d + 2) d+2\d+2 ' d 1 

the choices of m = [T a \ and E = m 2 [r 2 ^( d+2 )/ d ] yield the bound 



where 



sup«r<mai|^ + lY /a , L"<''*' 2 > T<'+»>"+ a > (9 + c(T,d))\ , (8) 



«r,0 - 5T-(ln(2I-)) J «-«' + 7- + 

vanishes as T tends to infinity. 

Note that the choices of and m solely depend on T, which may however 
be unknown in advance; standard arguments, like the doubling trick, can be 
used to circumvent the issue, at a minor cost given by an additional constant 
multiplicative factor in the bound. 

Remark 1. There is a trade-off between the value of the constant term in the 
maximum, (1+8M/ L) 1 /" , and the convergence rate of the vanishing term e(T, d) 
toward 0, which is of order 7. For instance, in the case d = 1, the condition on 
7 is < 7 < 2/15; as an illustration, we get 

— a constant term of a reasonable size, since 1/a ^ 4.87, when the convergence 
rate is small, 7 = 0.01; 

— a much larger constant, since 1/a — 60, when the convergence rate is faster, 
7 = 2/15-0.01. 



Proof. For the strategy INF, as recalled above, ip{T',m') = 2\j2T'm! . The 
bound of Lemma [3] can thus be instantiated as 



sup R T ^ Em d 



— + 2\/2Tm d 
m 



Wc now substitute the definition 



^2/(^+2)^1/(^+2) 



2/(d+2) T ,l/(<2+2) 



L 2/(d+2) rl/(d+2) _ 

and separate the cases depending on whether Lm^^T 1 / ( d +2) is smaller or larger 
than d to handle the second term in the expectation. In the first case, we simply 
bound rh by d and get a 2\j2Td d term. When the quantity of interest is larger 
than d, then we get the central term in the expectation below by using the fact 
that (1 + l/x) d ^ (1 + l/d) d sC e whenever x ^ d. That is, sup i?T is less than 



Enr + [r (d+1 >/( d+2 > ~ 2/ f d+2) + 2^ 2T e (tv^x 2 /^ 2 )) d + 2\/2T^ 



L 

We will now use the lower and upper bounds on L m stated by Corollary[T] In 
the sequel we will make repeated use of the following inequality linking a-norms 
and 1-norms: for all integers p, all u\, . . . , u p > 0, and all a e [0, 1], 

( Ul + ... + u p ) a + + u«. (9) 

By resorting to ©, we get that with probability at least 1 — 1/T, 



2^2Te(Ti/( d + 2 )L^ (d+2) y = 2^2e T^'^ L d ^ d+ ^ 



< 2V2~e r( d + 1 )/( d + 2 ) ( L + 2m J ^ hx(2m d T) ) 



— \ d/(d+2) 



E 



\ d/(d+2)\ 



< 2V2~e T^ 1 )/^ 2 ) ( L d /( d + 2 ' + ( 2m A /|ln(2m^T)J 

On the other hand, with probability at least 1 — 1/T, 

~ 7M L 

L m ^ L ^ — , 

m 8 

where we assumed that m and E are chosen large enough for the lower bound 
of Corollary [1] to be larger than L/8. This is indeed the case as soon as 

7M 7L , 8M 

^ — , that is, m ^ , 

m 8 L 

which is exactly the condition 

Putting all things together (and bounding m by T in the logarithm), with 

probability at least 1 — 1/T, the regret is less than 

Em d + T (d+1)/{d+2) - k 2^2Td d 

^ m + 1 ( L / 8) 2/(d+2) + zyzia 

s. d/(d+2)^ 



+ 2\f2e 2 l ( d + 1 )/( d + 2 ) I L d/{d+2) + ( 2m^/|ln(2T^+i) ) ) ; (10) 



on the event of probability smaller than 6 = 1/T where the above bound does 
not necessarily hold, we upper bound the regret by T. Therefore, the expected 
regret is bounded by (fTUj) plus 1. Bounding the constants as 8 2 ^ d+2 ^ ^ 8 2 / 3 = 4 
and 2 v / 2e ^ 5 concludes the proof of the first part of the theorem. 



The second part follows by substituting the values of E and m in the expres- 
sion above and by bounding the regret by To for the time steps t ^ Tq for which 
the condition © is not satisfied. 

More precisely, the regret bound obtained in the first part is of the desired 
order L d ^ d+2 '> T( d+1 )/( d + 2 ) only if E > m 2 and Em d « T (d+i)/(d+2) _ Thig . g 
why we looked for suitable values of m and E in the following form: 

m = \T a \ and E = m 2 \ T 2 ^ d + 2 )/ d ] , 

where a and 7 are positive. We choose a as a function of 7 so that the terms 

Em d — (LT Q J) d+2 [T 2 ^ d+2 )/ d ] 

T (d+i)/(d+2) ^_rn_^ d/{d+2) = T (d+i)/(d+2)^ T 2 1 (d+2)/d^y dl{2(d+2)) 



and 



are approximatively balanced; for instance, such that 

a(d + 2) + 2 1 {d + 2)1 d ={d+ l)/(d + 2) - 7 , 

which yields the proposed expression 0. The fact that a needs to be positive 
entails the constraint on 7 given in ([7|1. 

When condition ([6]) is met, we substitute the values of m and E into ([5]) 
to obtain the bound the only moment in this substitution when taking the 
upper or lower integer parts does not help is for the term Em d , for which we 
write (using that T ^ 1) 

Em d = m d+2 |Y 27(d+2)/<f | < T ot(d+2)/ 1 + T 2 1 (d+2)/d\ 

^ 2rpa(d+2)rp2 1 {d+2)/d _ 2 jr(d+l)/(d+2)-7 

When condition (|6|) is not met, which can only be the case when T is such 
that T Q < 1 + 8M/L, that is, T < T = (1 + 8M/L) 1 /", we upper bound the 
regret by T . □ 
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A Proof of Theorem [T] 

(Omitted from the Proceedings of ALT'll) 

Proof. We slightly adapt the (end of the) proof of [BMSS111 Theorem 13]; we 
take the metric £(x, y) = L\\x — y\\oo- For e e (0, 1/2), the e-packing number of 
[0, l] d with respect to I equals 

N{[Q,l] d ,l, e) = ([L/e\) d ^2 

provided that L/e ^ 2, that is, e ^ L/2. Therefore, Step 5 of the mentioned 
proof shows that 



T 



sup R T > Te 0.5- 2.2 e. 

V V (LVej)V 

for all < e < min{l,i}/2. We now optimize this bound. 
Whenever L/e ^ max{d, 2}, we have 

lL/e\ >L/e-l>~ L/e 

in the case where d — 1, while for d ^ 2, 

(LL/ £ J)%(Ve-l)^i(i/e) rf , 

where we used the fact that (1 - l/x) d > (1 - l/d) d > (1 - 1/2) 2 = 1/4 for all 
x ^ d and <i ^ 2. 

Therefore, whenever < e < min{l/2, L/d, L/2}, 

sup R T > Te ^0.5 - AAe 1+d/2 \fj^J ■ 
We take e of the form 

e = 1 L d/{d+2) T~ 1/(d+2) 
for some constant 7 < 1 to be defined later on. The lower bound then equals 

7 L d/(d+2) T (d+l)/(d+2) (0.5-4.4 7 1 + d / 2 ) ^> 7 £<*/( d + 2 ) T (d+l)/(<i+2) (q. 5-4.4 7 3 / 2 ) 

where we used the fact that 7 < 1 and d ^ 1 for the last inequality. Taking 7 
such that 0.5 - 4.4 7 3 / 2 = 1/4, that is, 7 = 1/(4 x 4.4) 2 / 3 ^ 0.14, we get the 
stated bound. 

It only remains to see that the indicated condition on T proceeds from the 
value of e provided above, the constraint e < min{l/2, L/d, L/2}, and the upper 
bound 7 ^ 0.15. □ 



